Method of segmenting an audio stream

ABSTRACT

Disclosed herein is a segmentation method, which divides an input audio stream into segments containing different homogeneous signals. The main objective of this method is localization of segments with stationary properties. This method seeks all no-stationary points or intervals in the audio stream and creates a list of segments. The obtained list of segments can be used as an input data for the following procedures, such as classification, speech/music/noise attribution and so on. The proposed segmentation method is based on the analysis of audio signal statistical features variation and comprises three main stages: stage of first-grade characteristics calculation, stage of second-grade characteristics calculation and stage of decision-making.

BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The present invention relates to a segmentation method of an audio data stream, which is broadcasted or recorded using some media, wherein this audio data stream is a sequence of digital samples, or may be transformed to the sequence of digital samples. The goal of such segmentation is the division of audio data stream into segments, which correspond to different physical sources of audio signal. In the case when some source(s) and some background source(s) emit audio signal, parameters of background source(s) is not changed essentially in the framework of one segment.

[0003] 2. Description of the Related Art

[0004] Audio and video recordings have become commonplace with the advent of consumer grade recording equipment. Unfortunately, both the audio and video streams provide few clues to assist in accessing the desired section of the record. In books, indexing is provided by the table of contents at the front and the index at the end, which readers can browse to locate authors and references to authors. A similar indexing scheme would be useful in an audio stream, to help in location of sections where, for example, specific speakers were talking. The limited amount of data associated with most audio records does not provide enough information for confidently and easily access to desired points of interest. So, user has to peruse the contents of a record in sequential order to retrieve desired information.

[0005] As a solution of this problem, it is possible to use the automatic indexing system of audio events in the audio data stream. The indexation process consists of two sequential parts: segmentation and classification processes. Segmentation process implies division of the audio stream into homogeneous (in some sense) segments. Classification process implies the attributing of these segments by appropriate notes. Thus, segmentation process is the first and very important stage in the indexation process. To this problem, the basic notice in the given invention is given.

[0006] As the basic audio events in the audio stream, it is accepted to consider speech, music and noise (that is non-speech and non-music). The basic notice in a world is given to the speech detection, segmentation and indexation in audio stream, such as broadcast news.

[0007] Broadcast news data come to use in long unsegmented speech streams, which not only contain speech with various speakers, backgrounds, and channels, but also contain a lot of non-speech audio information. So it is necessary to chop the long stream into smaller segments. It is also important to make these smaller segments homogeneous (each segment contains the data from one source only), so that the non-speech information can be discarded, and those segments from the same or similar source can be clustered for speaker normalization and adaptation.

[0008] Zhan et al., “Dragon Systems' 1997 Mandarin Broadcast News System”, Proceedings of the Broadcast News transcription and Understanding Workshop, Lansdowne, Va., pp. 25-27, 1998, produced the segments by looking for sufficiently long silence regions in the output of a coarse recognition pass. This method generated considerable multi-speaker segments, and no speaker change information was used in the segmentation.

[0009] In the subsequent works, Wegmann et al., “Progress in Broadcast News Transcription at Dragon System”, Proceedings of ICASSP'99, Phoenix, Ariz., March, 1999, used the speaker change detection in the segmentation pass. The following is a procedure of their automatic segmentation:

[0010] An amplitude-based detector was used to break the input into chunks that are 20 to 30 seconds long.

[0011] These chunks were chopped into 2 to 30 seconds long, based on silences produced from a fast word recognizer.

[0012] These segments were further refined using a speaker change detector.

[0013] Balasubramanian et al., patent U.S.5,606,643, enables retrieval based on indexing an audio stream of a recording according to the speaker. In particular, the audio stream may be segmented into speaker events, and each segment labeled with the type of event, or speaker identity. When speech from individuals is intermixed, for example in conversational situations, the audio stream may be segregated into events according to speaker difference, with segments created by the same speaker identified or marked.

[0014] Creating an index in an audio stream, either in real time or in post-processing, may enable a user to locate particular segments of the audio data. For example, this may enable a user to browse a recording to select audio segments corresponding to a specific speaker, or “fast-forward” through a recording to the next speaker. In addition, knowing the ordering of speakers can also provide content clues about the conversation, or about the context of the conversation.

[0015] The ultimate goal of the segmentation is to produce a sequence of discrete segments with particular characteristics remaining constant within each one. The characteristics of choice depend on the overall structure of the indexation system.

[0016] Saunders, “Real-Time Discrimination of Broadcast Speech/Music”, Proc. ICASSP 1996, pp. 993-996, has described a speech/music discriminator based on zero-crossings. Its application is for discrimination between advertisements and programs in radio broadcasts. Since it is intended to be incorporated in consumer radios, it is intended to be low cost and simple. It is mainly designed to detect the characteristics of speech, which are described as, limited bandwidth, alternate voiced and unvoiced sections, limited range of pitch, syllabic duration of vowels, energy variations between high and low levels. It is indirectly using the amplitude, pitch and periodicity estimate of the waveform to carry out the detection process since zero-crossings give an estimate of the dominant frequency in the waveform.

[0017] Zue and Spina, “Automatic Transcription of General Audio Data: Preliminary Analyses”, Proc. ICSP 1996, pp. 594-597, use an average of the cepstral coefficients over a series of frames. This is shown to work well in distinguishing between speech and music when the speech is band-limited to 4 kHz and music to 16 kHz but less well when both signals occupied a 16 kHz bandwidth.

[0018] Scheier and Slaney, “Construction and Evalution of a Robust Multifeature Speech/Music Discriminator”, Proc. ICASSP 1997, pp. 1331-1334, use a variety of features. These are: four hertz modulation energy, low energy, roll off of the spectrum, the variance of the roll off of the spectrum, the spectral centroid, variance of the spectral centroid, the spectral flux, variance of the spectral flux, the zero-crossing rate, variance of the zero-crossing rate, the cepstral residual, variance of the cepstral residual, pulse metric. The first two features are amplitude related. The next six features are derived from the fine spectrum of the input signal and therefore are related to the techniques described in the previous reference.

[0019] Carey et al., “A Comparison of Features for Speech, Music Discrimination”, Proc. IEEE 1999, pp. 149-152, use a variety of features. There are: cepstral coefficients, delta cepstral coefficients, amplitude, delta amplitude, pitch, delta pitch, zero-crossing rate, delta zero-crossing rate. The pitch and cepstral coefficients encompass the fine and broad spectral features respectively. The zero-crossing parameters and the amplitude were believed worthy of investigation as a computationally inexpensive alternative to the other features.

SUMMARY OF THE INVENTION

[0020] The present invention provides a segmentation procedure to chunk an input audio stream into segments having homogeneous acoustic characteristics. This audio stream is a sequence of digital samples, which are broadcasted or recorded using some media.

[0021] An object of the invention is to provide a fast segmentation procedure with a relatively low numerical complexity.

[0022] The segmentation procedure comprises three stages. These are: first-grade characteristic calculation, second-grade characteristic calculation, and decision-making. The stage of first-grade characteristic calculation is aimed for calculation of audio features vectors from the input audio stream. These features vectors define characteristics of audio signals. The stage of second-grade characteristic calculation forms sequence of statistic features vectors from the sequence of audio features vectors. The statistic features vectors define statistic features of the first-grade features. The stage of decision-making analyses variation of the second grade features and performs definition of the segments boundaries basing on that analysis.

[0023] Thus, an essential aim of the invention is to provide the segmentation method, firstly, that can be used for a wide variety of applications, secondary, that the segmentation procedure may be industrial-scaled manufactured, based on the development of one relatively simple integrated circuit.

[0024] Other aspects of the present invention can be seen upon review of the figure, the detailed description and the claims, which follow.

BRIEF DESCRIPTION OF THE DRAWING

[0025] The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate an embodiment of the invention and together with the description serve to explain the principle of the invention. In the drawings:

[0026]FIG. 1 is a block diagram of a generalized audio processing system within which the present invention may be embodied;

[0027]FIG. 2 is a generalized flow diagram of the audio segmentation procedure;

[0028]FIG. 3 is a flow diagram in detail of the audio segmentation procedure;

[0029]FIG. 4 illustrates a flowchart of the sub-stage of initial segmentation;

[0030]FIG. 5 illustrates a flowchart of the sub-stage of accurate segmentation;

[0031]FIG. 6 shows improvement of the dividing markers positions;

[0032]FIG. 7 shows defining the homogeneous interval inside the segment.

DETAILED DESCRIPTION OF THE INVENTION

[0033]FIG. 1 is a block diagram of a generalized audio processing system 1, within which the present invention may be embodied. Generally, an audio stream is provided from a source of audio data 2, which may be provided by recorded broadcast, a recorded video with accompanying audio track, or other audio source. The audio data is sent to an audio processor 3, which may be any well-known device such as a general purpose computer, configured according to the present invention. The audio processor outputs segments of the audio data 4.

[0034]FIG. 2 is a generalized flow diagram of an audio segmentation procedure 5. Box 10 is the audio stream input, for example, broadcast news input. The step in box 20 is aimed for calculation of audio features vectors from the audio stream. These features vectors define characteristic features of the audio-stream. Next step 30 forms sequence of statistic features vectors from the sequence of audio features vectors. The statistic features vectors define statistic characteristic of the audio features vectors. At the step 40, variation of the statistic features vectors is analyzed and the definition of the segment boundaries basing on that analysis is performed. Thus, the proposed segmentation procedure is based on the analysis of audio signal statistical features variation. The output of the resulting index segmentation of the audio stream is performed at the step 50.

[0035]FIG. 3 is a flow diagram in detail of the audio segmentation procedure. After the input of audio stream data 10, an input sequence of the digital samples is divided into the sequence of short (e.g. 10-20 ms) not-overlapped frames 21. At the step 25, the feature vectors are computed for each frame. This computation is performed using 10th order Linear Predictive Coding (LPC) analysis of the samples in possibly overlapped windows, which contain said frames.

[0036] Parameters of the autoregressive linear model, which is the foundation of LPC analysis are reliable and may be defined with relatively small computation complexity. The following parameters form coordinates of audio features vector:

[0037] Λ_(i),i=1.5—Formants Frequencies;

[0038] K¹,K²—First and the Second Reflection Coefficients;

[0039] E⁰—Energy of the Prediction Error Coefficient;

[0040] E¹—Preemphasized Energy Ratio Coefficient.

[0041] Parameters K¹,K², E⁰ are calculated simultaneously with LPC analysis, according to Marple, Jr. “Digital Spectral Analysis”, Prentice-Hall, Inc., Englewood Cliffs, N.J., 1987. After LPC analysis, 10 coefficients Line Spectral Pairs (LSP) are computed according to the patent U.S. Pat. No. 4,393,272 or ITU-T, Study Group 15 Contribution--Q. 12/15, Draft Recommendation G.729, Jun. 8, 1995, Version 5.0. Λ_(i),i=1.5 Formant Frequencies are calculated as half of sum of the corresponding LSP coefficients. E¹ is the ratio of the energies in the 6-dB preemrhasized first-order difference audio signal to the regular audio signal, according to Campbell et al. “Voiced/Unvoiced Classification of Speech with Applications to the U.S.

[0042] Government LPC-10E Algorithm”, Proceedings ICASSP' 86, April, Tokyo, Japan, V.1, pp 473-476.

[0043] As the result, there are the audio feature vectors (9 characteristics at all). These vectors have definite physical meaning and the dynamical range sufficient for the precise segmentation of the audio stream. The further work of the segmentation procedure is the statistical analysis of the obtained data. The calculation of the statistical characteristics is performed in non-overlapped second-grade windows, each of these windows consists of some predefined number of frames (e.g. 20-100 frames in one window). Thus, some number of vectors of the first-grade characteristics describes such a window. The division of the input sequence of the audio feature vectors is performed at the step 31. At the step 35, the sequence of those vectors is transformed to the statistic feature vectors.

[0044] The statistical features vector {right arrow over (V)} consists of two sub-vectors, the first of them consists of: $\begin{matrix} {V_{j} = {{\frac{1}{M}{\sum\limits_{i = 1_{M}}^{M}\quad {\Lambda_{i},\quad j}}} = {1\quad \ldots \quad 5}}} \\ {V_{j + 5} = {{\frac{1}{M}{\sum\limits_{i = 1}^{M}\quad {\left( {\Lambda_{i} - V_{j}} \right)^{2},\quad j}}} = {1\quad \ldots \quad 5}}} \end{matrix}$

[0045] and the second of these sub-vectors consists of: $\begin{matrix} {{V_{11} = {\left( {{\max\limits_{i = {1\ldots \quad M}}\left\{ K_{t}^{2} \right\}} - {\min\limits_{i = {1\quad \ldots \quad M}}\left\{ K_{t}^{2} \right\}}} \right) \times \frac{1}{M}{\sum\limits_{i = 1}^{M}K_{i}^{2}}}}\quad} \\ {V_{12} = {\frac{1}{M}{\sum\limits_{i = 1}^{M}\quad {E_{i}^{0} \times \frac{1}{M}{\sum\limits_{i = 1}^{M}\quad \left( {E_{i}^{0} - {\frac{1}{M}{\sum\limits_{i = 1}^{M}\quad E_{i}^{0}}}} \right)^{2}}}}}} \\ {V_{13} = {\sum\limits_{i = 2}^{M}\quad \left| {E_{i}^{0} - E_{i - 1}^{0}} \middle| {- \sum\limits_{i = 1}^{M}}\quad \middle| E_{i}^{0} \right|}} \\ {V_{14} = {{\max\limits_{i = {1\ldots \quad M}}\left\{ E_{i}^{1} \right\}} - {\min\limits_{i = {1\ldots \quad M}}\left\{ E_{i}^{1} \right\}}}} \\ {V_{15} = {{\sum\limits_{i = 1}^{M}\quad {{B(i)}{\quad,}\quad {where}\quad {B(i)}}} = \left\{ {\begin{matrix} {{1,\quad K_{i}^{1}} > {{predefined}\quad {thresholds}}} \\ {0,\quad {otherwise}} \end{matrix},} \right.}} \end{matrix}$

[0046] where M is a number of frames in one window.

[0047] As the result, there are the statistic feature vectors (15 characteristics at all).

[0048] The sub-stages of the decision-making 40 will be discussed in more details below, but FIG. 3 serves to give an overview of the method described by the invention.

[0049] The sub-stage of initial segmentation 100 is performed in such a way that the dividing markers, which corresponds boundaries of segments, are determined with the accuracy corresponding to one second-grade window. The sub-stage of improvement of the segmentation precision 200 carried out by the previous step implies the correction of the position of each dividing marker with the accuracy corresponding to one frame and eliminating of false segments. The sub-stage of internal markers definition 300 implies the determination of a stationary interval inside each segment. The resulting sequence of the not intersected audio segments with their time boundaries is outputted at the step 50.

[0050] Sub-Stage of Initial Segmentation

[0051]FIG. 4 illustrates a flowchart of the sub-stage of initial segmentation 100 of FIG. 3. In this sub-stage, the statistical features vectors {right arrow over (V)}[k],k=1, . . . , K are analyzed. On each step, the algorithm of this sub-stage parses four sequential entry vectors. The result of the analysis is the information, where the dividing marker is placed.

[0052] Let {right arrow over (V)}[k],{right arrow over (V)}[k+1],{right arrow over (V)}[k+2],{right arrow over (V)}[k+3] four sequential statistical features vectors, which are taken 136 from the set of sequential statistical features vectors.

[0053] The differences A_(j) ^(i)=|V_(i)[k+j]−V_(i)[k+j+1]|, j=0,1,2,i=1, . . . , 10

[0054] are calculated for the first sub-vectors of the statistical features vectors 137. If at least one of these values is greater than the corresponding predefined threshold 138, the dividing marker is installed between the second-range windows 139. In this case, another steps of this sub-stage does not performed and the next four vectors, first of which is the first vector after the installed dividing marker will be taken from the set of sequential statistical features vectors for analysis 148.

[0055] Otherwise the differences A^(i)=|(V_(i)[k]+V_(i)[k+1])−(V_(i)[k+2]+V_(i)[k+3])|, i=11.15 are calculated 140 for the second sub-vectors of the statistical features vectors. These values are matched with the predefined thresholds 141. The case when all of these values are smaller than the corresponding threshold values corresponds to the absence of the dividing marker 142. In this case, the last steps of this sub-stage does not performed and the next four vectors, first of which is the vector {right arrow over (V)}[k+1] will be taken from the set of sequential statistical features vectors for analysis 148. Otherwise the differences A_(j) ^(i)=|V_(i)[k+j]−V_(i)[k+j+1]|, i=11.15, j=0,1,2 are calculated 143 for the second sub-vectors of the statistical features vectors. If at least one of these values is greater than the corresponding predefined thresholds 144 then the dividing marker is installed between the second-range windows 145. In this case, another steps of this sub-stage is not performed and the next four vectors, first of which is the first vector after the installed dividing marker will be taken from the set of sequential statistical features vectors 148. Otherwise the next four vectors, first of which is the vector {right arrow over (V)}[k+1] will be taken from the set of sequential statistical features vectors for analysis 148. If the dividing marker is taken at the step in diamond 147, then the sub-stage of initial segmentation ends and the initial segmentation marker passes to the sub-stage of accurate segmentation.

[0056] Sub-Stage of Accurate Segmentation

[0057]FIG. 5 illustrates a flowchart of the sub-stage of accurate segmentation 200 of FIG. 3. The sense of this given stage operation consists in an improvement of dividing markers positions. It is achieved as a result of a precise statistical analysis of the sequence of LSP coefficients Λ_(i),i=1, . . . , 5 close to each dividing marker (see FIG. 6). Let's consider an arbitrary dividing marker μ with some neighborhood, which consists of n frames, close to Formants Frequencies coefficients. At the step in box 210, the difference is evaluated: $S_{j} = {\left| {{\frac{1}{j + 1}{\sum\limits_{p = 0}^{j}\quad {\sum\limits_{i = 1}^{5}\quad \Lambda_{i}^{({k + p})}}}} - {\frac{1}{n - j}{\sum\limits_{p = {j + 1}}^{n}\quad {\sum\limits_{i = 1}^{5}\quad \Lambda_{i}^{({k + p})}}}}} \middle| {,\quad j} \right. = {{{a{\quad,}\quad \ldots \quad,\quad n} - a - {1,\quad a}} < \frac{n}{2}}}$

[0058] Argument, which correspond to maximum value S_(j), is calculated at the step 220: $J = {\underset{j = {{n\quad,\quad \ldots \quad,\quad n} - n - 1}}{\arg {\quad \quad}\max}\left( S_{j} \right)}$

[0059] At the step 230, the new dividing marker μ is placed into the position corresponded to this J between shaded rectangles on FIG. 6. At the step in box 148 in FIG. 4, the shift of vectors is performed from the position of the new marker μ.

[0060] Sub-Stage of Internal Markers Definition

[0061] The sub-stage of internal markers definition of the final segmentation analyses each segment with the purpose of the definition of two internal markers (μ^(int),η^(int)) defining the most homogeneous interval inside the segment. It is made with the following purposes: the placed dividing markers separate two audio events of the different nature. These events, as a rule, smoothly transiting one to another and do not have drastic border. Therefore there is a time interval containing information about both the events. That may hamper their correct classification.

[0062] As well as at the previous sub-stage, this task is solved by usage of a precise statistical analysis of a sequence of Formants Frequencies coefficients Λ_(i),i=1, . . . , 5 close to each dividing marker. Let's consider an arbitrary segment, limited by markers μ and η, (so that η−μ=n+1 frames), and composed from Formants Frequencies coefficients (see FIG. 7).

[0063] Firstly, two differences are evaluated: $\begin{matrix} {S_{1j} = {\left. {{\frac{1}{j + 1}{\sum\limits_{p = 0}^{j}\quad {\sum\limits_{i = 1}^{5}\quad \Lambda_{i}^{({k + p})}}}} - {\frac{1}{{n/2} - j}{\sum\limits_{p = {j + 1}}^{n/2}\quad {\sum\limits_{i = 1}^{5}\quad \Lambda_{l}^{({k + p})}}}}} \middle| {,\quad j} \right. = {{{a\quad,\quad \ldots {\quad,}\quad {n/2}} - a - {1\quad {,\quad}\quad a}} < \frac{n}{4}}}} \\ {{S_{2j} = {\left. {{\frac{1}{j + 1}{\sum\limits_{p = {n/2}}^{j + {n/2}}\quad {\sum\limits_{i = 1}^{5}\quad \Lambda_{l}^{({k + p})}}}} - {\frac{1}{{n/2} - j}{\sum\limits_{p = {j + 1 + {n/2}}}^{n}\quad {\sum\limits_{l = 1}^{5}\quad \Lambda_{i}^{({k + p})}}}}} \middle| {,\quad j} \right. = {{{a\quad,\quad \ldots {\quad,\quad}{n/2}} - a - {1\quad,\quad a}} < \frac{n}{4}}}}\quad} \end{matrix}$

[0064] At the second, arguments, which correspond to maximum values S_(1j) and S_(2j), are calculated: $J_{1} = {{\underset{j = {{a,\quad \ldots,\quad {n/2}} - a - 1}}{\arg \quad \max}\left( S_{1j} \right),\quad J_{2}} = {\underset{{j = {a\quad,\quad \ldots {\quad {,\quad {{n/2} - a - 1}}}}}\quad}{\arg \quad \max}{\left( S_{2j} \right)\quad.}}}$

[0065] Then, the new markers μ^(int) and η^(int) are placed into the positions corresponded to these J₁, J₂ between shaded rectangles on FIG. 7.

[0066] Thus, the process of segmentation is ended. As the result, the sequence of not intersected audio intervals with their time boundaries is obtained. 

What is claimed is:
 1. A method of segmentation of an audio stream, wherein the segmentation is the division of the audio stream into segments containing different homogeneous signals.
 2. The method according to claim 1, wherein the audio stream is a sequence of digital samples which are broadcasted or recorded using some media.
 3. The method according to claim 1, wherein the audio stream segmentation is performed in three stages: the stage of the first-grade characteristic calculation, the stage of the second-grade characteristic calculation, and the stage of the decision-making analysis.
 4. The method according to claim 3, wherein the stage of the first-grade characteristic calculation is performed by the division of the audio stream into frames for which of them the audio feature vector is calculated.
 5. The method according to claim 4, wherein the audio feature vector consists of five formant frequencies, the first and the second reflection coefficients, the energy of the prediction error coefficient, and the preemphasized energy ratio coefficient.
 6. The method according to claim 3, wherein the stage of the second-grade characteristic calculation is performed in the sequence of the predefined and not overlapped windows, each of them consists from definite number of said frames with said audio feature vectors calculated at the stage of the first-grade characteristic calculation.
 7. The method according to claim 6, wherein the stage of the second-grade characteristic calculation consists in the calculation of the statistical feature vector for each said window.
 8. The method according to claim 7, wherein the statistical feature vector consists from two sub-vectors, firsts of which consists from mean values of the formant frequencies, and dispersions of the formant frequencies, and the second of said sub-vectors consists from difference between maximal and minimal values of the second reflection coefficient multiplied by the mean value of the second reflection coefficient, product of the mean value and the dispersion of the energy of the prediction error coefficient, sum of the modules of differences between said energies of the prediction error coefficients for said neighboring frames divided by the sum of the modules of said energies of the prediction error coefficients, difference between maximal and minimal values of said preemphasized energy ratio coefficients, and number of said frames in the window in which the first reflection coefficients outnumber predefined a predefined threshold value.
 9. The method according to claim 3, wherein the stage of the decision-making analysis is performed in three sub-stages: the sub-stage of initial segmentation, the sub-stage of accurate segmentation, and the sub-stage of the internal markers definition.
 10. The method according to claim 9, wherein the sub-stage of initial segmentation is performed basing on the analysis of the four sequential statistical feature vectors to define where the dividing marker has to be placed.
 11. The method according to claim 10, wherein said analysis of the four sequential statistical feature vectors is performed in three steps, first of which may signalize about installation of the dividing marker, in this case other steps do not performed, the second of these steps may signalize about the absence of the dividing marker, in this case the third of said steps does not performed, and the third of said steps signalize about absence or installation of the dividing marker.
 12. The method according to claim 11, wherein the first of said steps includes calculation of the modules of differences between two sequential coordinates of the first said sub-vector of the statistical feature vectors, comparison of the calculated values with the predefined threshold values, and the installation of the dividing marker if at least one said modulo is greater than the corresponding threshold value
 13. The method according to claim 11, wherein the second of said steps includes calculation of the modules of differences between sum of two sequential coordinates of the second said sub-vector of the statistical feature vectors and sum of next two sequential coordinates of the second said sub-vector of the statistical feature vectors, comparison of the calculated values with the predefined threshold values, and signalizing about the absence of the dividing marker in the case when all of said modules are smaller than the corresponding threshold values.
 14. The method according to claim 11, wherein the third of said steps includes calculation of the modules of differences between two sequential coordinates of the second said sub-vector of the statistical feature vectors, comparison of the calculated values with the predefined threshold values, the installation of the dividing marker if at least one said modulo is greater than the corresponding threshold value, and signalizing about the absence of the dividing marker in the opposite case.
 15. The method according to claim 9, wherein the sub-stage of accurate segmentation is performed basing on the results of the initial segmentation and the analysis of sequence of said formant frequencies calculated for frames close to the dividing marker.
 16. The method according to claim 15, wherein the analysis is based on calculation of the set of values of modules of differences between sums of mean values of the formant frequencies, wherein each modulo is calculated for the two intervals close to the dividing marker.
 17. The method according to claim 16, wherein the intervals are sequential not-overlapped intervals, each of them has one and only one fixed border, and the border between said intervals is varied when the calculation of the set of values of the modules is performed.
 18. The method according to claim 16, wherein the result of the accurate segmentation is the new position of the dividing marker, which corresponds to the maximum value of the values of modules of differences between sums of mean values of said formant frequencies.
 19. The method according to claim 9, wherein goal of the sub-stage of the internal markers definition is a definition of two internal markers, which determine a most homogeneous interval inside each segment that was received at the sub-stage of accurate segmentation.
 20. The method according to claim 19, wherein the two internal markers correspond to left and right halves of the segment and are calculated independently using calculation of the set of values of modules of differences between sums of mean values of the formant frequencies for each half of the segment.
 21. The method according to claim 20, wherein the values of modules of differences are calculated for two not-overlapped intervals, each of them covers corresponding half of said segment, and the border between these two intervals is varied when the calculation of the set of values of the modules is performed.
 22. The method according to claim 21, wherein each of the internal markers corresponds to the maximum of the values of modules of differences. 